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(54) System for distributed multiprocessor communication 



(57) A tightly coupled communication scheme 
based on a common shared resource circuit and adapt- 
ed particularly to a multiprocessing system including 2 N 
CPUs. A mechanism has been added that allows data 
in a shared register to be read and incremented as a 
single instruction, eliminating the need tor semaphore 
manipulations during the operation. A second mecha- 
nism has been added to permit the use of indirect ad- 
dressing in the addressing of semaphore bits and 
shared registers. Operating systems can relocate sem- 
aphore bits and message areas to permit simultaneous 
execution ot the same function within a single task. In 
addition, an instruction has been added which tests of 
the semaphore bit and acts upon the state of that bit. If 
the semaphore bit is not set then the processor takes 
control of the semaphore bit by setting it. If the sema- 
phore bit is set, the processor will execute a branch and 
execute other instructions. Thus, jobs assigned to a 
processor in a multiprocessing, multitasking application 
do not block or wait for the semaphore bit to clear. 
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Description 

Background of the Invention 

Field of the invention 

The present invention pertains to the field of high 
speed digital data processors and more particularly, to 
communication between processors in a multiprocessor 
system. 

Background information 

Interprocessor communication is an important fac- 
tor in the design of effective multiprocessor data 
processing systems for multitasking applications. Sys- 
tem processors must be able to execute independent 
tasks of different jobs as well as related tasks of a single 
job. To facilitate this, processors of a multiprocessor sys- 
tem must be interconnected in some fashion so as to 
permit programs to exchange data and synchronize ac- 
tivities. 

Synchronization and data transfers between inde- 
pendently executing processors typically are coordinat- 
ed through the use of controlled access message boxes. 
A single bit semaphore is used to prevent simultaneous 
access to the same message box. In operation, a proc- 
essor tests the state of the semaphore bit. If the sema- 
phore bit is set, the message box is currently "owned" 
by another processor. The requesting processor must 
then wait until the semaphore is cleared, at which time 
it sets the semaphore and can access the message box. 

A typical approach to interprocessor communica- 
tion in prior art machines was to use main memory as 
the location of the message boxes and their associated 
semaphore bits. This "loosely coupled" approach mini- 
mizes interprocessor communication links at the cost of 
increasing the overhead for communications. However 
when the number of processors in a multiprocessing 
system increases, processors begin to contend for lim- 
ited resources. For instance, accessing a "global* loop 
count stored in main memory and used to track itera- 
tions of a process executed by a number of different 
processors is relatively simple when there are only two 
or three processors But in a loosely coupled system a 
processor's access to a global loop count contends with 
other processors' accesses to data in memory. These 
contentions delay all memory requests. 

A different approach was disclosed in U S. Patent 
No. 4,836,942 issued to Chen et al and in U.S. Patent 
No. 4,754,398 issued to Pribnow, both of which patents 
are hereby incorporated herein by reference The above 
documents disclose "tightly coupled" communication 
schemes using dedicated "shared* registers for storing 
data to be transferred and dedicated semaphores for 
protection of that data. Shared registers are organized 
to provide N + 1 "clusters" where N equals the number 
of processors in the system Clusters are used to restrict 



access to sets of shared registers. Processors are as- 
signed to a cluster as part of task initialization and can 
access only those shared registers that reside in their 
cluster. A semaphore register in each cluster synchro- 
s nizes access to cluster registers by processors as- 
signed to the same cluster. 

Tightly coupled communication schemes reduce 
communication overhead by separating interprocessor 
comm unseat ion from the accesses to memory that occur 
10 as part of the processing of a task. However, even in 
tightly coupled systems, communication overhead in- 
creases as a function of the number of processors in a 
system. This increased overhead directly impacts sys- 
tem performance in multitasking applications, A large 
number of processors contending for a piece of data 
(such as a global loop count) can tie up even a dedicated 
cdmmunications path due to increased message traffic 
This has been recognized and steps have been pro- 
posed to streamline communications in a tightly coupled 
system. 

Patent No, 4,754,398 discloses a method for reduc- 
ing interprocessor communication traffic incurred in ex- 
ecuting semaphore operations in a tightly coupled sys- 
tem. A copy of a cluster's global semaphore register is 
kept in a local semaphore register placed in close prox- 
imity to each processor in the cluster. Operations on a 
cluster's global semaphore register are mirrored in op- 
erations on the local semaphore registers associated 
with that cluster. The use of a local semaphore register 
reduces the delay between the issuance of a sema- 
phore test command and the determination of the state 
of that semaphore. 

Commonly owned, copending application No. 
07/308,401 by the present inventor goes a step further 
by streamlining the local semaphore testing and by re- 
placing the shared real time clock circuit with distributed 
local real time circuits. That application also extends the 
tightly coupled design to a system of eight processors. 
It is hereby incorporated by reference. 

In the above system the shared semaphore and in- 
formation register circuit is partitioned such that one 
byte of the 64 bit interprocessor communication system 
is located on each processor board. The bytes are dis- 
tributed such that the least significant byte of each in- 
formation register resides on CPUO and the most signif- 
icant byte on CPU7. Interprocessor communication 
commands are a single byte in length; these commands 
are replicated at the source so as to send the same com- 
mand byte to each shared circuit in the system. 

Global semaphore registers for the above system 
are distributed among the processors. Since each sem- 
aphore register is only 32 bris wide, the least significant 
byte of each semaphore register is kept on CPU4 and 
the most significant byte is kept on CPU7. 

A local control circuit is placed on each processor 
board. This circuit receives a interprocessor communi- 
cation instruction from the processor on the board and 
determines when to issue the instruction to the shared 
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communication circuitry. In addition the control circuit 
knows the cluster that the processor is assigned to and 
keeps a copy of the semaphore register associated with 
that cluster in its local semaphore register 

By software convention, a CPU wishing to access s 
a shared information register must gain control of the 
semaphore associated with that register. First, the CPU 
issues a Test__and„Set instruction on the semaphore. If 
the bit is set, the local circuit halts the CPU until the bit 
clears and there are no other higher priority interproc- 10 
essor communication requests, The local circuit then al- 
lows issue of the Test_and„3et instruction and the prop- 
er semaphore is set in the shared semaphore register 
and in each local semaphore register assigned to that 
cluster. is 

Once the semaphore bit is set the CPU can access 
its associated information register by issuing a 
Shared_RegisterJRead or Shared_Register„Write in- 
struction. Upon completion of the necessary operations 
on the shared register, the CPU clears the semaphore so 
bit in the shared semaphore register and the proper bit 
in the local semaphore registers assigned to that cluster 
are cleared. While the semaphore bit is set no other 
processor can access the associated information regis- 
ter. 25 

As the number of processors increase, the methods 
disclosed to date are not adequate to meet the needs 
of systems having an increased number of processors 
The steps required to access and control global varia- 
bles such as bop counts stored in shared registers adds so 
a significant burden to communications overhead. In the 
meantime, access to these registers by other proces- 
sors in the cluster is not permitted. Processors requiring 
access to the loop count must wait until the semaphore 
bit is cleared. This has the potential to waste a consid- 35 
arable amount of GPU time. 

It is clear that further changes are necessary in the 
design of a tightly coupled communication circuit to 
achieve reduced message traffic. 

40 

Summary of the Invention 

The present invention is an implementation of a 
tightly coupled communication scheme adapted partic- 
ularly to, but without limitation thereto, a system inciud- 45 
ing 16 CPUs. According to the present invention a 
mechanism has been added that allows data in a shared 
register to be read and incremented as a single instruc- 
tion, thus eliminating the need for semaphore manipu- 
lations during the operation. so 

According to another aspect of the present inven- 
tion, an instruction has been added which tests of the 
semaphore bit and acts upon the state of that bit. If the 
semaphore bit is not setthen the processortakes control 
of the semaphore bit by setting it. If the semaphore bit 55 
is set, the processor will execute a branch and execute 
other instructions. Thus, the job does not block or wait 
for the semaphore bit to clear. 



According to yet another aspect of the present in- 
vention, a mechanism has been added to permit the use 
of indirect addressing in the addressing of semaphore 
bits and shared registers. Operating systems can relo- 
cate semaphore bits and message areas to permit si- 
multaneous execution of the same function within a sin- 
gle task. 

Brief De scription o f the Drawings 

Fig. 1 is a high-level block diagram of a tightly cou- 
pled multiprocessor system according to the present in- 
vention. 

Fig. 2 is a block diagram of the common shared reg- 
ister resource circuitry according to the present inven- 
tion. 

Fig. 3 is a simplified schematic block diagram of the 
local shared register access circuitry according to the 
present invention. 

Fig. 4 is a table illustrative of a write operation ac- 
cording to the present invention. 

Fig. 5 is a table illustrative of an I/O channel oper- 
ation according to the present invention. 

Detailed Description of the Preferred Embodiment 

in the following detailed description of the preferred 
embodiment, references made to the accompanying 
drawings which form a part thereof, and which is shown 
by way of illustration a specific embodiment in which the 
invention may be practiced. The preferred embodiment 
of the present invention is designed to operate within a 
tightly coupled multiprocessor system of sixteen proc- 
essors It is to be understood that other embodiments 
may be utilized and structural changes may be made 
without departing from the scope of the present inven- 
tion. 

FIG. 1 illustrates a high-level block diagram of the 
tightly coupled multiprocessor communication system 
200 within a multiprocessor data processing system. 
Processors 202. 1 through 202.N are connected to local 
control circuits 10.1 through IO N, respectively. Local 
control circuits 10.1 through IO N are connected in turn 
through shared register write paths 72.1 through 72. N 
and shared register read paths 74.1 through 74. N to 
shared resource circuit 70 In the preferred embodi- 
ment, paths 72 and 74 are 64 bits wide. Also, in the pre- 
ferred embodiment, each local control circuit is placed 
on its associated processor's circuit board to ensure 
close proximity. This permits use of a separate instruc- 
tion path and separate 64 bit address register and scalar 
register read and write paths to connect processors 202 
to local control circuits 10 

Further, in the preferred embodiment shared re- 
source circuit 70 is partitioned by bit-slicing the registers 
in circuit 70 into N equal subcircuits and duplicating the 
control circuits so as to create N autonomous subcircuits 
71.1 through 71. N. One subcircuit 71 is then placed on 
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a circuit board with a processor 202 and a local control 
circuit 10, reducing the number of circuit boards in the 
system. 

FIG. 2 illustrates ashared resource subcircuit 71 tor 
a multiprocessing system containing sixteen proces- 
sors. Four bit lines from shared register write paths 72. 1 
through 72.18 are connected through write selector 76 
to write registers latch 78. The four bits sent to each sub- 
circuit 71 depend on the processor board that the sub- 
circuit is placed on. In the preferred embodiment, sub- 
circuit 71 placed on Processor M receives bits (M*4) 
through {M*4)+3 as shown in FIG. 4. 

In a like manner, read registers latch 88 is connect- 
ed to four bit lines from shared register read paths 74. 1 
through 74.18 for transferring data from the shared re- 
source circuit 70 io the requesting processor 202. 

Write registers latch 78 is connected to global 
shared registers 90, through command demultiplexer 84 
to command decoder 80, through read registers selector 
92 to read latch 88 and through I/O channel demultiplex- 
er 86 to one or more I/O channels (not shown). I/O chan- 
nel multiplexer 96 and shared registers 90 are also con- 
nected through read registers selector 92 to read latch 
88. In addition shared registers 90 are connected to read 
and increment circuit 94 for automatically incrementing 
the contents of a register within shared registers 90. 

In the preferred embodiment, shared registers 90 
are segmented into N+1 clusters of sixteen information 
registers (eight shared B and eight shared T) and one 
semaphore register. Shared B registers are used to 
transfer addresses; shared T registers are usedto trans- 
fer scalar data. Access to registers within each cluster 
ss limited to those processors 202 that are assigned to 
that cluster. 

Command decoder 80 is connected to write selector 
76, shared registers 90, read selector 92 and read and 
increment circuit 94. Command decoder 80 decodes 
commands received from local control circuits 10.1 
through 1 0. 1 6 and controls the movement of data within 
resource subcircuit 71 . Command decoder 80 also pro- 
vides feedback to local control circuits 10.1 through 
10.16 so they can modify their local semaphore regis- 
ters to reflect changes in shared semaphore registers. 
In addition, command decoder 80 controls operation of 
the attached I/O channel. 

Shared register write paths 72.1 through 72.N 
transmit commands and data to shared resource regis- 
ter 70. In the preferred embodiment, commands are ei- 
ther eight or twelve bits in length. Therefore, since each 
subcircuit 71 runs independently, the local control circuit 
10 sending the command must replicate and send it to 
each of the subcircuits 71.1 through 71. N. For the six- 
teen processor case, the first four bits of a command 
from processor 202.1 are transferred on write path 72. 1 
to each subcircuit 71 . 1 through 71 .16 at the same time. 
Then the next four bits are transferred, followed by the 
next four bits of command and data if required. Each 
subcircuit then reconstructs the command using com- 



mand demultiplexer 84 before presenting the command 
to command decoder 80. 

Local control circuits 10.1 through 10.N arbitrate 
among themselves to prevent more than one access to 

5 shared resource circuit 70 at a time. A local control cir- 
cuit 1 0 uses a CPU Jn_Progress line 32 to indicate that 
it has control of shared resource 70. In the preferred em- 
bodiment, each shared resource subcircuit 71.1 through 
71 N is connected to a CPUJn„Progress line 32 from 

10 each local control circuit 10 1 through 10.N. The result- 
ing NTN lines are used by the command decoder 80 on 
each subcircuit 71 to select (through write selector 76) 
the write path 72 associated with the requesting proc- 
essor 202. 

is FIG . 3 shows an electrical block diagram of the local 
control circuit 10 of FIG. 1. Issue control 1 6 is connected 
to current instruction parcel (CSP) register 1 2, local sem- 
aphore register 18, semaphore selector 22, command 
generator 20 and, externally, to each of the other control 
20 circuits 1 0 and to shared resource circuit 70. Issue con- 
trol 16 manages the issuance of instructions having to 
do with shared resource circuit 70. Through CIP register 
12, issue control 16 receives instructions from its re- 
spective processor 202. Issue control 16, in turn, acts 
25 through semaphore index selector 24 to steer sema- 
phore selector 22 with the contents of either CIP register 
1 2 or of a processor 202 address register. The selected 
semaphore bit can then be tested by issue control 16 In 
the execution of a test and set instruction. 
30 Issue control 16 generates a shared resource re- 
quest 30 to each of the other local control circuits 1 0 and 
arbitrates received resource requests 34 from the other 
local circuits 10. Once it has gained control of shared 
resource circuit 70, issue control 16 asserts a 
35 CPU Jn„Progress line 32 to shared resource 70 and 
causes command generator 20 to generate a command 
based on the contents of CIP register 12. In the pre- 
ferred embodiment, the resulting command is multi- 
plexed by command multiplexer 26 into two to three nib- 
40 bles (four bits each) and sent to each subcircuit 71 of 
shared resource circuit 70. 

Command generator 20 is connected to CIP regis- 
ter 12; to processor 202 address registers and, through 
command multiplexer 26, to write data selector 44. Write 
^5 data selector 44 routes data from processor 202 scalar 
and address registers, from address register multiplexer 
47 and from command multiplexer 26 through local write 
data latch 45 to write data path 72. 

Data coming from read path 74 is latched in local 
so read data latch 46. Real time clock 58 is connected to 
read data latch 46 to facilitate broadcast loading of an 
arbitrary start time. Read data selector 60 is connected 
to read data latch 46 directly and through read data de- 
multiplexer 50 and to real time clock 58. Data from read 
55 data selector 60 can be stored to local semaphore reg- 
ister 18 or to processor 202 scalar and address regis- 
ters. Semaphore register 1 8 can be loaded directly from 
selector 60 or modified one bit at a time through local 
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semaphore modifier 14. Locai semaphore modifier 14 
is connected in turn to command decoder 80 for moni- 
toring activity in the shared semaphore registers. 

Issue control 16 controls movement of data through 
control circuit 10. Instructions are stored in CIP register 
12 until issue control 16 determines that shared re- 
source circuit 70 is ready to accept a command. Issue 
control 1 6 also controls data output by semaphore index 
selector 22, writs data selector 44 and read data selec- 
tor 60 through selector control 33. 

As in the previously referred to copending applica- 
tion by the present Inventor, each processor 202 is as- 
signed a cluster number as part of loading a executable 
task into the processor. When the task is loaded, proc- 
essor 202 registers the cluster number and requests 
and loads the semaphore register associated with that 
cluster into its local semaphore register 18. From that 
point on, the local control circuit 10 associated with that 
processor 202 maintains a copy of the assigned clus- 
ter's shared semaphore register in its locai semaphore 
register 18. 

Shared semaphore registers are used to synchro- 
nize activity and to restrict access to shared information 
registers. In one typical operation, an access to shared 
information registers begins with processor 202 issuing 
a "test and set" command to local control circuit 10. Lo- 
cal control circuit 10 then checks the status of the ap- 
propriate bit in its local semaphore register 18. if the bit 
is set s then another processor has control of that shared 
register and processor 202 waits for the b it to be cleared. 
If the bit is not set, local control circuit 10 asserts its 
CPUJn_Progress line 32 to each of the shared re- 
source subcircuits 71 and sends a command to set the 
bit in the semaphore register for that cluster. 

By software convention, setting a bit in the shared 
semaphore register grants control circuit 10 access to 
the associated shared information register. Control cir- 
cuit 10 then has exclusive control to read or write that 
register. Upon finishing, control circuit 10 clears the set 
semaphore bit and another processor can access the 
register. 

In the present invention, a new command has been 
added to further improve the efficiency of the computing 
system Where in past machines a processor such as 
processor 202 tested a semaphore bit and then was re- 
quired to wart uni\\ it cleared, the new command tests 
the semaphore bit, returns the status and branches to 
alternate instructions on determining that the bit is set 
This frees up CPU cycles that were otherwise wasted 
waiting for access to a shared register shared by many 
CPUs 

This new "test and set or branch" instruction is use- 
ful at the operating system level in providing alternatives 
to just sitting and waiting for a system resource to free 
up. In previous systems, if two CPUs attempted to use 
the system resource, one CPU would gain control of the 
resource and the other would wait until it was finished. 
With the new instruction the second CPU can test for 



availability of the system resource. If the resource is 
busy, it can continue performing operating system func- 
tions. This permits a polling approach to system re- 
sources rather than the previous "get it or wait" ap- 
© proach. 

Semaphore registers are 32 bits wide. To test a bit 
in local semaphore register 18, the contents of CIP reg- 
ister 12 are used to steer the appropriate bit through 
semaphore bit selector 22 to issue control 16. if the bit 
is clear, issue control 16 asserts a shared resource re- 
quest 30 to each local control circuit 10 and compares 
its request to requests 34 received from other locai con- 
trol circuits 10. In the preferred embodiment, it has been 
determined that optimal access to shared resource cir- 
cuit 70 is obtained when priority in accessing shared re- 
source circuit 70 is granted to the processor 202 with 
the lowest CPU number while requiring that a processor 
202 cannot assert a request as long as there is an active 
request 34 pending from a processor 202 with a higher 
CPU number. That is, in a sixteen processor system, 
CPU15has thehighest priority in makinga request while 
CPU0 has the highest priority in getting an active re- 
quest served. This provides an equal opportunity for all 
processors 202 to access shared resource 70. Once a 
request line 30 is set it remains set until the circuit 10 
has completed its function, for example, until the data is 
transferred in a write operation or until the control infor- 
mation including the register address has been trans- 
ferred to circuit 70 in a read operation. 

Once a processor 202 has obtained access to the 
shared registers, command generator 20 is activated by 
issue control 1 6 to generate, in accordance with the op- 
eration specified in CIP register 12, two to three nibbles 
of command. This command is sent to each resource 
subcircuit 71 where it is received by command decoder 
80 and used to control and accomplish the sought alter 
operation. Command multiplexer 26 takes the first nib- 
ble generated by command generator 20 and sends six- 
teen replicas of that nibble on the sixty four bit wide write 
path 72. This is followed in subsequent clock periods by 
sixteen replicas of the remaining command nibbles. The 
active CPU Jn^Progress line 32 causes command de- 
coder 80 on each subcircuit 71 to select the write path 
72 associated with the processor 202 controlling the 
shared register access. Each write registers iatch 78 of 
each of the subcircuits 71 of FIG. 2 simultaneously re- 
ceives the first four bits of the command followed in sub- 
sequent clock periods by the remaining nibbles. The 
command nibbles are reconstructed into a command in 
command demultiplexer 84 and presented to command 
decoder 80 for disposition. The command decoder 80 
on each subcircuit 71 thus each simultaneously re- 
ceives the control information, necessary to control 
shared register access and, in particular, the addressing 
of the shared registers in shared registers 90. 

In the preferred embodiment of the present inven- 
tion, shared register and real time clock commands are 
two nibbles each. I/O, semaphore and cluster number 
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commands are three nibbles each. 

An example of a read operation will be described. 
As mentioned above, access to a shared register typi- 
cally begins with a "test and set" instruction aimed at 
gaining control of the register The iocai control circuit 
10 associated with that processor 202 receives the In- 
struction. It checks the local semaphore bit. If the bit is 
clear, control circuit 10 checks to see if a processor with 
higher CPU number has a request pending. If so, issue 
control 16 waits until the request clears before generat- 
ing its own request, if not, issue control 16 generates a 
request. Next, issue control 16 checks its request 
against requests pending by other processors with a 
lower CPU number. It there are requests from proces- 
sors with lower CPU numbers pending, issue control 16 
waits until those requests clear. Once there are no re- 
quests from processors with lower CPU numbers, issue 
control 16 sets the CPUJn_Progress line 32 to each of 
the subcircuits 71 and activates command generator 20 
to generate a command based on the contents of CIP 
register 1 2, The command generated contains the loca- 
tion of the bit in the semaphore register that is to be set. 
Multiplexer 26 replicates the three nibbles of the com- 
mand and broadcasts them to each subcircuit 71 in suc- 
cessive clock periods. 

Each subcircuit 71 contains a list of the clusters and 
the processors currently assigned to each cluster. This 
list is updated each time a processor is assigned to a 
new cluster. The command decoder 80 in each subcir- 
cuit 71 decodes the command and sets the appropriate 
bit in the shared semaphore register associated with the 
cluster the processor is assigned to. In addition, each 
command decoder 80 generates a signal to each local 
semaphore modifier 14 assigned to that cluster so that 
the copy of the shared semaphore register in Its local 
semaphore register 18 is updated. 

Once the semaphore bit is set, processor 202 is- 
sues a "read registers" instruction. The local control 10 
generates a request as above Once it has gotten con- 
trol of shared resource 70, issue control 16 sets the 
CFUJn„Progress line 32 to each of the subcircuits 71 
and activates command generator 20 to generate a 
command based on the contents of CIP register 1 2. The 
two nibble command includes the address of the desired 
register in shared registers 90 Multiplexer 26 again gen- 
erates two nibbles that are sent to each subcircuit 71 in 
successive clock periods Command decoder SO in 
each subcircuit 71 decodes the command, reads the ad- 
dressed register in the cluster the processor is assigned 
to, and writes the contents to read fatch 88. Read latch 
88 on each subcircuit 71 writes its four bit nibble to read 
path 74. 1 through 74. N such that the four bits from each 
subcircuit 71 combine to form a single sixty-four bit word 
on each read path 74 This word is latched into read data 
latch 46 on the requesting local control circuit 10 and 
sent through selector 60 to the appropriate scalar or ad- 
dress register. 

In a like manner, a write operation is performed on 



shared registers 90 beginning with distribution of the two 
control nibbles to each subcircuit 71 but followed on the 
next succeeding clock period by transmission of data 
from a selected address register Aj, a selected scalar 
s register S, or the output of multiplexer 47. A write oper- 
ation for a sixteen processor system is illustrated in FIG. 
4. Since four bits of write path 72 are connected to each 
subcircuit 71 , four bits of the sixty-four bit data word are 
written into write latch 78 and from there into shared reg- 
io jsters 90. As can be seen in FIG. 4, in the first clock 
period, the four least significant bits of the command are 
transferred to the subcircuit 71 located on each proces- 
sor board. In the next clock period, the remaining four 
bits of the command are transferred and in the following 
15 clock period the word to be written is transferred, with 
the bits distributed as shown in FIG. 4. Again, the des- 
tination cluster is determined by looking at the list of 
processor cluster assignments and the destination reg- 
ister is determined from the command. 

The present embodiment permits indirect address- 
ing of registers in shared resource 70. The ability to re- 
assign registers is useful because operating systems 
can relocate semaphore bits and message areas to per- 
mit simultaneous execution of the same function within 
a single task. 

In the preferred embodiment, instructions issued by 
processor 202 for shared resource access contain a 
three bit j field and a three bit k field. In previous ma- 
chines the k field was concatenated to the end of the 
two least significant bits of the j field to form a pointer to 
the location of the semaphore bit for a semaphore in- 
struction. This convention is still used in the present em- 
bodiment on semaphore instructions in which the most 
significant bit j 2 is cleared, However, if the most signifi- 
cant bit ] 2 of the j field is set indirect addressing is ena- 
bled. This means the k field becomes a pointer io an 
address register A k . Address register \ then contains 
the location of the semaphore bit that is to be acted up- 
on. 

In a like manner in previous machines the j field 
was used to form an address to a register in the shared 
resource circuit for a register instruction. If the least sig- 
nificant bit kQ of the k field is cleared in an instruction 
according to the present embodiment, this convention 
still holds. However, if the least significant bit U$ of the 
k field is set in a register instruction, the j field forms a 
pointer to an address register A^. Address register A^ 
then contains the address of the register to be accessed, 
in either case, for indirect addressing, the contents of 
the address register becomes part of the command 
transmitted to shared resource 70. 

A significant feature of the present embodiment is 
its ability to increment the contents of a shared B register 
"on the fly". This is important in eliminating steps re- 
quired to increment a loop count in a task in which iter- 
ations of a loop are being performed by more than one 
processor. In previous machines, in order to perform a 
read and increment, a processor was required to issue 
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a "test and set* instruction to grab control of the neces- 
sary shared B register, This was followed by issuing a 
"read register" instruction to read the contents of the reg- 
ister and piace it in a processor register. There the proc- 
essor performed the increment and then issued a "write 
register" instruction to piace the loop count back in the 
original shared B register. The processor clears the 
semaphore bit. 

in the present embodiment, this array of instructions 
has been replaced with a single "read and increment 8 
instruction. The "read and Increment" instruction causes 
read and increment circuit 94 to capture the loop count 
as it Is read from shared registers 90, increment it and 
write the result back into the same shared B register, 
This operation is performed as a single sequence of 
events, eliminating contention from processors seeking 
the same variable and, therefore, removing the require- 
ment to grab control of the register via a "test and set" 
semaphore command. The "read and increment 81 func- 
tion leads to a savings in clock periods that would offer 
significant advantages in multiprocessing applications. 

In the preferred embodiment, the bit-slicing of 
shared resource 70 into subcircuits 71 means that each 
read and increment circuit 94 must propagate its carry 
to its next most significant neighbor. In reality, due to the 
speed with which the calculation must be performed in 
order to save the result, it ss necessary to generate a 
propagate line that is sent to all cards with bits more 
significant than the current card. Since the shared B reg- 
isters are limited to 32 bits located on processor boards 
0 through 7, this means that CPUO must generate a 
propagate to CPU1 through CPU7 and GPU7 must be 
capable of accepting up to seven propagate lines and 
determining from them if it must perform an increment 
of its internal four bits. Since it is desireable for the proc- 
essor boards to be identical, the basic processor board 
must be able to handle any combination of up to seven 
Carry J ns and seven Garry_Outs. 

In the preferred embodiment, command decoder 80 
contains the circuitry necessary to individually control 
the I/O channels associated with the processor 202 on 
whose board it resides, Command decoder 80 gener- 
ates I/O control signals and I/O demultiplexer 86 pro- 
vides I/O addresses. Since each I/O address is 32 bits 
wide and only four bits can be transferred to a subcircuit 
71 at a time, a multiplexing scheme is used in which the 
I/O address is transferred four bits at a time for eight 
consecutive clock periods. Operation of an I/O channel 
is illustrated for the sixteen processor case in FIG. 5, On 
the first three clock periods, the command nibbles are 
broadcast to all subcircuits 71 . As illustrated, the second 
and third nibble transmitted contain the I/O channel 
number obtained from an address register Aj. The index 
j is determined from the j field in the instruction in CIP 
register 12. Following that broadcast, in the subsequent 
eight clock periods, the I/O address is broadcast four 
bits at a time to all subcircuits 71. The I/O address is 
retrieved from an address register A k . Again, the index 



k is determined from the k field in the same instruction 
in CIP register 12. Each subcircuit 71 examines the I/O 
channel number received and determines if the channel 
number belongs to a channel on its processor board. If 
5 so, command decoder. 80 on that processor board acti- 
vates the channel and transfers the received I/O ad- 
dress to that channel. 

In a like manner an I/O address can be read from 
an I/O channel, formed into eight nibbles by multiplexer 
io 98 and read back through read registers latch 88. This 
I/O interface functionality gives each subcircuit 71 the 
ability to control the I/O channels on its processor board. 

in the preferred embodiment, a real time clock cir- 
cuit 58 is provided within each local control circuit 10 
?5 Clock circuit 58 can be read by an instruction placed in 
CI P register 1 2 or loaded through read data latch 46 with 
the contents of a processor 202 scalar register Sj (where 
the index] is determined from the instruction in CIP reg- 
ister 12). Real time clock circuit 58 can only be loaded 
through shared resource circuit 70, Data from a scalar 
register Sj on one of the processors 202. 1 through 202. N 
is written through write registers latch 78 and read reg- 
isters selector 92 to read registers latch 88, From there 
it is broadcast to the clock circuit 58 on each of local 
circuits 10.1 through 10.N. The new starting time is load- 
ed to each of the real time clock circuits 58 within the 
same clock period. 

Although the present invention has been described 
with reference to the preferred embodiments, those 
skilled in the art will recognize that changes may be 
made in form and detail without departing from the spirit 
and scope of the invention. 



claims 

1 . In a method of accessing data in an information reg- 
ister in a tightly coupled snterprocessor communica- 
tion system for a multiprocessor data processing 

40 system; wherein said communication system com- 
prises a separate communications path (72, 74), a 
common shared resource circuit (70) connected to 
said path and distributed local control means (10) 
connected to each of a plurality of processors (202) 

45 and to the communications path (72, 74) for com- 
municating and coordinating data transfer between 
said shared resource circuit and the connected 
processor; wherein said shared resource circuit 
includes common registers (90) including shared 

so semaphore registers and shared information regis- 
ters and wherein said local control means includes 
a local semaphore register (18) whose contents 
mirror the contents of an associated shared sema- 
phore register, wherein the method is of the type 

ss wherein a local semaphore register bit associated 
with a desired shared information register is tested 
and, if the local semaphore register bit is not set, a 
bit is set in the associated shared semaphore reg- 
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ister corresponding to the local semaphore register 
bit, the desired shared information register is 
accessed through the local control means and the 
associated shared semaphore register bit is 
cleared, the improvement characterized by: £ 

if, when the local semaphore bit is tested, the 
local semaphore register bit is found to be set, 
branching and executing instruclions starting at a 
branch address, 

10 

2. in an interprocessor communication system In 
which access to a memory location (90) within a 
shared resource circuit (20) is shared by a plurality 
of processors (202), a method at automatically 
applying a function independent of the processors 75 
to modify the data in said location, wherein the func- 
tion is applied as a result of a "read and modify" 
instruction received from one of said processors, 

the method comprising the steps of: 

20 

reading the data in said location; 
capturing the data within the shared resource 
circuit as it is being sent to a requesting proc- 
essor; 

performing the function on the captured data to 
form a result; 

storing the result back to said location. 

3. An interprocessor communication system for a mul- 
tiprocessor data processing system, comprising: 30 

(a) a shared resource circuit including a plural- 
ity of clusters each cluster including a shared 
semaphore register and a plurality of shared 
information registers; 35 

(b) the shared resource circuit further including 
access control means for limiting access by 
each processor to the registers within a single 
cluster and auto increment means for automat- 
ically incrementing data read from one of said 40 
registers and storing the result back into the 
same register; 

(c) each processor including means for issuing 
instructions to access the common semaphore 
and information registers in said shared 45 
resource circuit; 

(d) local control means connected to each proc- 
essor and in relatively ciose proximity to its 
respective processor as compared to said com- 
mon circuit, wherein said local control means so 
includes a local semaphore register and issue 
control means for monitoring and controlling 

the issue of instructions requiring access to 
said shared resource circuit from the proces- 
sor; and 55 

(e) each of said local control means further 
including data control means for the transfer of 
data from its respective processor to a common 
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register or from a common register to its 
respective processor. 

4, The interprocessor communication system accord- 
ing to claim 3 wherein each local control means fur- 
ther includes command means for developing a 
command based on an issued Instruction from its 
respective processor, said command being sent to 
said shared resource circuit in order to gain access 
to said shared circuit by the processor 

5. The interprocessor communication system accord- 
ing to claim 4 wherein each command means 
includes shared register address means for indirect 
addressing of the shared registers with the contents 
of a register connected to its respective processor. 

8. The interprocessor communication system accord- 
ing to claim 5 wherein each local control means fur- 
ther includes a real time clock circuit accessible by 
its respective processor, 

7. The interprocessor communication system accord- 
ing to claim 8 wherein each local control means fur- 
ther includes separate read and write paths con- 
nected to said shared resource circuit. 

8. The interprocessor communication system accord- 
ing to claim 7 wherein each processor further 
includes address registers and scalar registers and 
each write path includes multiplexer means for 
selectively placing the contents of one of said com- 
mand means, said address registers and said sca- 
lar registers on said data path. 

9. The interprocessor communication system accord- 
ing to claim S wherein the shared resource circuit 
further includes I/O channel communication means 
for linking the shared resource circuit to an I/O chan- 
nel and the local control circuit further includes 
means to transfer address information to said I/O 
channel communication means. 

10. A method of forming an interprocessor communica- 
tion system for transferring data and synchronizing 
activity between processors in a multiprocessor 
data processing system of N processors, compris- 
ing: 

providing a common shared resource circuit 
including shared semaphore registers and 
shared information registers, wherein the 
shared information registers are usable for 
holding data to be accessed by a processor and 
the shared semaphore registers are usable for 
controlling access to a shared information reg- 
ister and for synchronizing activity between two 
processors; 



EF 0 712 076 A2 



8 



15 



EP 0 712 076 A2 



16 



partitioning the resource circuit into resource 
circuit blocks such that 1/N of the bits in each 
information register is placed in each block and 
a bSock is placed in relatively close proximity to 
each processor as compared to the other proc- s 
essors; 

providing local control means connected to 
each processor, wherein said local control 
means is placed in relatively close proximity to 
its respective processor as compared to the fo 
remaining processors and said local control 
means includes issue control means and com- 
mand means for coordinating communication 
between its respective processor and the 
shared resource circuit; and is 
providing interprocessor communication 
means for transferring commands and data 
between said local control means and said 
common shared resource circuit. 

20 

11. The method according to claim 10 wherein the 
method further includes providing command means 
connected to each local control circuit and each 
processor for forming an interprocessor communi- 
cation command from an issued instruction. ?s 

12. The method according to claim 11 wherein the step 
of providing command means further includes pro- 
viding means for forming a shared register address 

in said interprocessor communication command so 
from the contents of a register attached to it's 
respective processor. 

13. The method according to claim 12 wherein the 
method further comprises: 

dividing said information registers into clusters, 
assigning one of said semaphore registers to 
each cluster; and 

restricting access by each processor to those 40 
information and semaphore registers in their 
cluster. 

14. The method according to claim 1 3 wherein the step 

of dividing the information registers into clusters fur- 45 
ther includes dividing the information registers into 
N+1 clusters, wherein each cluster contains the 
same number of information registers. 

15. The method according to claim 1 4 wherein the step so 
of dividing the information registers into clusters fur- 
ther includes restricting the number of information 
registers in each cluster to sixteen, wherein eight 
registers are used for scalar data and eight registers 
are used for address data. ss 

16. An interprocessor communication system for a mul- 
tiple processor computing system, comprising: 



a shared information register; 
a shared semaphore register including a bit 
used to control access to said shared informa- 
tion register; 

a plurality of local circuits, wherein a local circuit 
is placed in close proximity and connected to 
an associated processor and wherein a local 
circuit includes 

a current instruction parcel register for 
receiving instruction parcels from the asso- 
ciated processor; 
a real time clock; 
a local semaphore register; 
shared semaphore register monitoring 
means for monitoring changes in the 
shared semaphore register and reflecting 
those changes in the local semaphore reg- 
ister; 

local semaphore testing means for testing 
a bit in said local semaphore register; 
an instruction issue control connected to 
said local semaphore testing means and to 
each of the other local circuits for monitor- 
ing requests for interprocessor communi- 
cation from other bcal circuits and for ena- 
bling the issue of instructions from the cur- 
rent instruction parcel register as a function 
of the state of a bit tested in its local sem- 
aphore register and of the requests 
received from other local circuits; and 
control generation means connected to 
said current instruction parcel and said 
instruction issue control for converting 
issued instructions into a command, said 
control generation means including regis- 
ter address means for indirect addressing 
of the shared registers with the contents of 
a register connected to its respective proc- 
essor; and 

interprocessor communication means con- 
nected to said plurality of local circuits, said 
shared information register and said shared 
semaphore register for transferring a command 
from one of said local circuits to said shared 
registers in order to perform one of a group of 
functions including 

reading the shared information register; 
writing the shared information register; and 
loading the contents of the semaphore reg- 
ister into the local semaphore registers. 

17. The interprocessor communication system accord- 
ing to claim 16 wherein the shared information reg- 
ister includes autoincrement means for automati- 
cally incrementing data read from said register and 
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storing the result back into the register and the 
group of functions performed by a command further 
includes reading and incrementing the contents of 
the shared register. 

5 

18. The interprocessor communication system accord- 
ing to claim 16 wherein the system further com- 
prises i/O channel means connected to said inter- 
processor communication means for reading and 
writing to an I/O channel. 10 
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(54) System for distributed multiprocessor communication 



(57) A tightly coupled communication scheme 
based on a common shared resource circuit and adapt- 
ed particularly to a multiprocessing system including 2 N 
CPUs. A mechanism has been added that allows data 
in a shared register to be read and incremented as a 
single instruction, eliminating the need lor semaphore 
manipulations during the operation. A second mecha- 
nism has been added to permit the use of indirect ad- 
dressing in the addressing of semaphore bits and 
shared registers. Operating systems can relocate sem- 
aphore bits and message areas to permit simultaneous 
execution of the same function within a single task. In 
addition, an instruction has been added which tests of 
the semaphore bit and acts upon the state of that bit. If 
the semaphore bit is not set then the processor takes 
control of the semaphore bit by setting it If the sema- 
phore bit is set, the processor will execute a branch and 
execute other instructions. Thus, jobs assigned to a 
processor in a multiprocessing, multitasking application 
do not block or wait for the semaphore bit to clear. 
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